Mon. Not. R. Astron. Soc. 000, 1-12 (201 1) Printed 2 August 201 1 (MN KTgX style file v2.2) 



A fast recursive coordinate bisection tree 
for neighbour search and gravity 

Emanuel Gafton* & Stephan Rosswog 

Jacobs University Bremen, Campus Ring 1, 28759, Bremen, Germany 



o 



(N 



Or 

6 
h 



> 
oo 

o 
q 

o 



Accepted 201 1 July 28. Received 201 1 July 14; in original form 201 1 May 12 



ABSTRACT 

We introduce our new binary tree code for neighbour search and gravitational force calcula- 
tions in an Af -particle system. The tree is built in a 'top-down' fashion by 'recursive coordinate 
bisection' where on each tree level we split the longest side of a cell through its centre of mass. 
This procedure continues until the average number of particles in the lowest tree level has 
dropped below a prescribed value. To calculate the forces on the particles in each lowest-level 
cell we split the gravitational interaction into a near- and a far-field. Since our main intended 
applications are SPH simulations, we calculate the near-field by a direct, kernel-smoothed 
summation, while the far field is evaluated via a Cartesian Taylor expansion up to quadrupole 
order. Instead of applying the far-field approach for each particle separately, we use another 
Taylor expansion around the centre of mass of each lowest-level cell to determine the forces 
at the particle positions. Due to this 'cell-cell interaction' the code performance is close to 
(N) where N is the number of used particles. We describe in detail various technicalities 
that ensure a low memory footprint and an efficient cache use. 

In a set of benchmark tests we scrutinize our new tree and compare it to the 'Press tree' that we 
have previously made ample use of. At a slightly higher force accuracy than the Press tree, our 
tree turns out to be substantially faster and increasingly more so for larger particle numbers. 
For four million particles our tree build is faster by a factor of 25 and the time for neighbour 
search and gravity is reduced by more than a factor of 6. In single processor tests with up 
to 10 8 particles we confirm experimentally that the scaling behaviour is close to o(N). The 
current Fortran 90 code version is OpenMP-parallel and scales excellently with the processor 
number (=24) of our test machine. 



Key words: methods: numerical 
namics. 



methods: N-body simulations - gravitation - hydrody- 



1 INTRODUCTION 

Self-gravity is a vital ingredient in the simulation of many astro- 
physical systems. Due to its long-range nature, self-gravity often 
becomes the major computational burden in simulations of A-body 
systems or self-gravitating fluids. While many methods exist to cal- 
culate gravitational forces under various circumstances (e.g. Hock- 
ney & Eastwood 1988), for general self-gravitating systems without 
particular symmetries hierarchical methods seem to be best suited. 
The most popular such approaches are tree methods (Barnes & Hut 
1986; Press 1986; Hernquist & Katz 1989; Benz et al. 1990; Couch- 
man 1991; Couchman et al. 1995; Dubinski 1996; Stadel 2001; 
Wadsley et al. 2004; Springel 2005; Nelson, Wetzstein & Naab 
2009), and -used to a lesser extent in astrophysics- the fast multi- 
pole method (FMM) originally suggested by Greengard & Rokhlin 
(1987). While tree methods restrict the order of their inherent mul- 
tipole expansion (usually to quadrupole order) and open up fur- 
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ther nodes for higher accuracy, the FMM instead fixes an accept- 
able error measure and then calculates whatever multipole order is 
necessary to achieve it. According to the common consensus, tree 
methods are advantageous for problems where a moderate force 
accuracy can be accepted and where one instead invests available 
computational resources in larger particle numbers, while the FMM 
shows its true strength at higher accuracy. There has, however, been 
recent progress in implementations of error-controled variants of 
the FMM (Dachsel 2010) and therefore the above consensus may 
need to come under renewed scrutiny. 

Hybrid methods combining elements of both trees and the FMM 
have also been developed and successfully tested. For example, 
Dehnen (2000, 2002) has developed a fast tree method that borrows 
from the FMM the idea of Taylor-expanding the fields at the 'sink 
cells' in order to calculate the accelerations at individual particle 
positions. This 'cell-cell-interaction' ensures a numerical complex- 
ity proportional to 0(N) rather than o(NlogN) like standard tree 
methods (N being the number of particles). 

Other Poisson solvers, such as the particle-particle-particle-mesh 
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(P3M) method, share the same philosophy of approximating forces 
due to distant particles while computing nearby interactions di- 
rectly, and exhibit the same o(NlogN) complexity. Although not 
so widely used by the stellar astrophysics community, P3M codes 
have been successfully employed in a large variety of cosmological 
simulations (Efstathiou et al. 1985; Couchman 1991; Macfarland 
et al. 1998) and also combined with SPH (Evrard 1988). Hybrid 
methods combining particle-mesh methods with trees, so-called 
TreePM methods, have also been extensively used in cosmologi- 
cal simulations (Xu 1995; Bode et al. 2000; Bagla 2002; Bagla & 
Ray 2003; Springel 2005). 

Here, we describe in detail our newly developed recursive coordi- 
nate bisection (RCB) tree. It will replace an optimized version of 
the Press tree (Press 1986; Benz, Thielemann & Hills 1989) that 
we have used for years in our existing codes (Rosswog & Davies 
2002; Rosswog & Price 2007; Rosswog et al. 2008). Our new tree 
will also become a core ingredient of a new SPH code that is cur- 
rently under development. Since each of these codes has its own 
peculiarities in terms of input physics, time integration schemes 
etc., we restrict ourselves in this paper to a self-contained descrip- 
tion of our tree module (methods and implementation details) and 
defer questions such as time integration etc. to a later point. The 
main purpose of this paper is to document in detail the entity of 
our tree ingredients and their interplay for future reference. Some 
of our tree ingredients have also been used in other tree implemen- 
tations, this will be discussed further in Sec. 2.5. 
Our tree is designed to efficiently find neighbour particles for 
smoothed particle hydrodynamics (SPH) calculations and for the 
fast calculation of gravitational forces. We were guided by applica- 
tions which do not exhibit any particular symmetries, such as stellar 
collisions and tidal disruptions of stars by black holes. In simula- 
tions that involve black holes particles are frequently absorbed near 
the horizon, and therefore just 'repairing' an existing tree becomes 
difficult. Instead, such simulations usually require a frequent and 
complete re-building of the tree from scratch. Therefore an effi- 
cient tree building phase is crucial for our intended purposes. This 
is part of the reason why we decided for an RCB tree during the 
code design phase. 

Our code has been written from scratch in clean Fortran 90 and does 
not make use of any external libraries. It is hardware- and platform- 
independent and as versatile as possible, making no assumptions 
about the problem type or the particle distribution. It particularly 
aims at scaling well on a large number of processors (for now using 
OpenMP) and at being memory efficient. It implements a number 
of optimization techniques that we describe in detail below. As we 
will demonstrate, our tree scales close to O (N) for large particle 
numbers. 

The remainder of the paper is structured as follows. In Sec. 2 we 
describe in detail how we build and walk the tree, and how we use it 
for efficient neighbour search and gravity calculations. In Sec. 3 we 
present a set of challenging benchmark tests where we compare our 
new results against those obtained with the so-called 'Press tree' 
(Press 1986; Benz, Thielemann & Hills 1989) that is widely used 
in astrophysics (e.g. Benz et al. 1990; Bate, Bonnell & Price 1995; 
Bonnell, Bate & Vine 2003; Price & Bate 2009; Nelson et al. 2009) 
and that we have frequently used ourselves (Rosswog et al. 1999; 
Rosswog & Davies 2002; Rosswog 2005; Price & Rosswog 2006; 
Rosswog, Ramirez-Ruiz & Hix 2009; Dan et al. 2011). In Sec. 4 
we summarize the main features of our tree method. 



2 A RECURSIVE COORDINATE BISECTION (RCB) 
TREE 

The underlying idea of a tree method is to collect particles into hi- 
erarchically organized groups (sometimes we also refer to them as 
'nodes' or via their enclosing 'cells') so that expensive tasks can be 
performed on aggregated quantities of the groups rather than on in- 
dividual particles. For astrophysical purposes, a tree method needs 
the following ingredients: i) a strategy how to aggregate particles 
into a hierarchy of groups ('tree build'), ii) composition formulae to 
calculate the properties of a 'mother' cell from the properties of its 
'daughters', iii) an 'opening criterion' that decides while combing 
through the tree whether a node can be accepted 'as is' or needs to 
be opened up into its lower level constituents, and finally iv) a pre- 
scription how to calculate forces from a list of aggregated nodes. 
Particularly popular in astrophysics are the Barnes-Hut octree, 
henceforth 'BH tree' (Barnes & Hut 1986; Hernquist 1987; Turner 
et al. 1995; Springel 2005; Merlin et al. 2010) and the mutual near- 
est neighbour binary tree due to Press (Press 1986; Benz et al. 
1989). 

The method of choice for constructing our tree is 'recursive coordi- 
nate bisection' . Initially all particles are grouped in a root cell and 
in subsequent steps the longest side of each cell is split through 
its centre of mass. This leaves an approximate balance between 
the particle numbers in each of the two resulting daughter cells. 
The process is repeated until the average particle number per cell 
has dropped below some pre-defined, empirically-optimised limit, 
that we will later refer to as Nu. Our procedure results in a bi- 
nary tree structure that is extremely fast to build up. Such trees are 
frequently used in computer science and are often referred to as fed- 
trees (Bentley 1975). To our knowledge they have only been used 
in astrophysics in the PKDGRAV code (Stadel 2001) which later be- 
came the basis of GASOLINE (Wadsley et al. 2004). In the way we 
build up our tree it delivers an adaptive mesh structure tailored to 
the particle distribution. The cells' labels carry information about 
local proximity so that the tree structure can be naturally and effi- 
ciently used to sort particles according to their spatial distribution. 
Similar techniques such as Peano or Morton space-filling curves 
are often used in particle methods to improve the computational 
speed via enhanced cache-coherence. 

In the remainder of this section we describe in detail how we build 
up our tree (Sec. 2.1), how we traverse it (Sec. 2.2) and search 
for neighbours (Sec. 2.3). The gravity calculations are described 
in Sec. 2.4 and in Sec. 2.5 we compare our approach with other 
work. 



2.1 Tree build 

2.1.1 Strategy 

A tree build can either follow a 'top-down' strategy where a 
root node that contains all particles is successively tesselated into 
smaller entities ('nodes' or 'cells'), or a 'bottom-up' approach, 
where, starting on the level of individual particles groups are 
formed that are subsequently collected into groups of groups and 
so on. The BH tree is an example of a top-down tree, the Press 
tree is a bottom-up tree. Building a tree in a top-down approach is 
substantially faster (Makino 1990), but it may generate situations 
in which particles that are well separated in space are artificially 
placed in the same node, while nearby particles are placed in differ- 
ent nodes. The factors that minimize the chances of this happening 
are the degree of decomposition and the load balancing algorithm, 
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Median position split (MPS) Centre of mass split (CMS) 




Figure 1. A 2D illustration of two ways to split a given cell: the median 
position split (MPS, left) and the centre of mass split (CMS, right). The 
MPS leads to perfectly balanced daughter nodes, but can artificially separate 
particles that naturally belong to the same group, as illustrated by particle 
a. Since we calculate our gravitational forces by a Taylor expansion around 
the centre of mass, the MPS can lead to large force errors, therefore we 
prefer the CMS for building our tree. 



which together define the prescription how a node is split into its 
children. 



In passing we note that one does not necessarily have to split a cell 
through its centre of mass, although this is convenient for our ap- 
plications. If systems with charges of different signs are simulated, 
say in plasma physics, it may be more convenient to split through 
the 'centre of number' (by simply replacing the particle masses by 
a weight of ' 1 '), and then use the proper charges for the calculation 
of the needed multipoles. 

By choosing the CMS rather than the MPS we pay the price that 
empty lowest-level cells ('11-cells') can, at least in principle, occur. 
The CMS procedure will often assign different numbers of particles 
to sibling cells and therefore one can eventually end up with empty 
11-cells at the expense of other 11-cells being over-populated. The 
latter will be very compact (since high densities are the reason why 
imbalanced distributions are obtained in the first place), and the 
multipole expansion will converge. At the same time, empty cells 
(less than 1% of the total number of cells even for pathological 
cases) can be safely ignored, since they do not contain neighbours 
and do not contribute to gravity either. 



2.1.2 Degree of decomposition 

Octrees (quadtrees in 2D) split a cell into eight (four in 2D) daugh- 
ter cells, binary trees only in two. Therefore octrees are = 3 
times shorter than binary trees and usually faster to construct. But 
by forcing all sides to be split, they tend to create elongated daugh- 
ter cells out of elongated parent cells. This is usually not desirable 
since such cells can have large high-order multipole moments and 
can thus introduce large truncation errors in the force calculation. In 
our binary tree approach we divide a node through its longest edge, 
which tends to drive daughter cells closer to the desired, compact 
shape. Binary trees also tend to return shorter interaction lists for a 
given accuracy (Anderson 1999; Waltz et al. 2002), which reduces 
the number of direct force evaluations and thus speeds up the grav- 
ity calculations. 

2. 1.3 Load balancing 

An important concern is keeping the decomposition of a tree bal- 
anced in the sense that sibling nodes contain comparable amounts 
of particles. Most implementations of the BH tree are spatially- 
balanced, cutting the edge of a cell through its middle. While this 
method is extremely fast because one only has to compute the aver- 
age of two coordinates, it can lead to very uneven particle distribu- 
tions. Density-balanced trees, on the other hand, ensure that com- 
parable computational domains remain on each side after the cut. 
In the design phase of our RCB tree, we have experimented with 
a median position split (MPS) and a centre of mass split (CMS), 
see Fig. 1. The MPS produces perfectly balanced particle numbers 
in each daughter cell, but it can lead to an artificial separation of 
particles that one would consider as belonging to the same group 
into different nodes. This is more than just an aesthetic flaw, since 
due to our Taylor expansion around the cell centre of mass, see Sec. 
2.4, particles that are artificially separated from their natural group 
of peers (e.g. particle a in Fig. 1, left) suffer substantially larger 
truncation errors for the MPS than for the CMS case. In our experi- 
ments we found that the relative force accuracy of such 'renegades' 
can be improved by orders of magnitude if the CMS is used instead. 
Another example would be a stellar binary system containing dif- 
ferent numbers of particles per star. The CMS assigns each star its 
own node, while the MPS would place some of the particles from 
the heavier star in the node of the lighter star, creating elongated 
cells and substantially larger multipole truncation errors. 



2.1.4 Particle sorting 

One technique that is frequently used in particle methods in order 
to reduce the number of cache misses is re-ordering the particles 
in memory based on their physical proximity (e.g. Springel 2005; 
Thacker & Couchman 2006; Nelson et al. 2009). In experiments 
of stellar collisions we had found that particle sorting can easily 
improve overall runtimes by a factor of a few, concluding that the 
overhead due to sorting is certainly worthwhile the extra effort. The 
RCB tree uses a customised partitioning algorithm inspired by the 
quick-sort method (Hoare 1962) to sort the particle arrays. 
We first compute the centre of mass of a node, then decide which 
direction x,- to split, and subsequently perform a single iteration 
through all the particles in the node. During this iteration, we move 
all particles that have the x,- coordinate smaller than that of the cen- 
tre of mass at the beginning of the array, and all particles with larger 
x; at the end of the array. The boundary between these two subsets 
then tells us where the particles of the left child node end and those 
of the right child node begin. There is no need to sort these two 
subsets of particles any further because on the next level one might 
have to split along a different side. 

One advantage of sorting the particles is that it eliminates the need 
to keep a list of all the particles in an 11-cell, the indices of the first 
and the last particles being sufficient (this actually holds true for 
any node, since particles are 'sorted' on each level). In addition, 
looping through the particles in any one node requires no jumps 
through the memory, and hence the array operations (such as the 
computation of the centre of mass) prove to be extremely cache- 
efficient. 

It is known that typical quicksort implementations have a worst- 
case complexity of o(N 2 ) (Press et al. 1992, Sec. 8.2), usually 
occuring if one repeatedly uses poorly-chosen pivots on already- 
sorted arrays. In our case, we compute the centre of mass anyway 
(since we later need it in the gravity calculations) so the pivot ele- 
ment is always optimally chosen. Since on each of the logN levels 
we traverse the arrays just once (O (A')), the complexity of this sub- 
routine is always o(NlogN). As will be shown below, see Fig. 14, 
the tree build takes just a small fraction of the CPU-time (less than 
a percent) in comparison to neighbour search and gravity, and the 
sorting, in turn, is just a small fraction of the tree build. The in- 
vestment of time in sorting is completely negligible but speeds up 
the tree build subroutine substantially. We also gain nearly a factor 
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Figure 2. Node labelling convention in the RCB tree. Many useful relations 
can be trivially recovered by simple arithmetic operations on the node labels 
without any need for additional memory consumption. See main text for 
details. 

three by substantially enhanced cache access during force evalua- 
tions. 

2.1.5 Node labelling 

A well-chosen node labelling system eliminates the need for point- 
ers that link parent nodes and their children. In the RCB tree, in- 
stead of storing the nodes as the components of a linked list (Du- 
binski 1996; Dehnen 2002; Nelson et al. 2009), we simply use a 
one-dimensional array, with the relationship between the nodes en- 
coded in their indices. 

Our node labelling conventions are illustrated in Fig. 2. For a tree 
with k levels the total number of nodes is 2 k — 1. The best choice is 
to assign the root node the label T, place it on level '0', and then 
continue to increment the labels level-by-level. With this labelling 
convention, simple 'index gymnastics' allows to recover various 
node relations without additional memory consumption. For exam- 
ple: 

(1) the children of node n are 2w and 2n+ 1; 

(2) the level of cell n is [log 2 «] (GauB bracket); 

(3) the first node on level k is 2 k \ 

(4) the last node on level k is 2 k+l — 1; 

(5) the number of cells on level k is 2 k ; 

(6) the total number of descendants of a cell on level k is 2 P — 2, 
where p is the number of levels greater than or equal to k. 

The last relation is used, for example, when entire branches are dis- 
carded during the tree walk. Fig. 3 illustrates how the node labels 
relate to the spatial tesselation for a two-dimensional particle dis- 
tribution. 

2.1.6 Multipole moments 

During the tree building phase, a number of physical quantities are 
calculated and stored for each node: position of the lowest-left cor- 
ner, size, coordinates of the geometrical centre and of the centre of 
mass, and the multipole moments. For the gravitational forces from 
distant nodes we use a simple Cartesian multipole expansion up to 
quadrupole order. Since the dipole moment vanishes when the ori- 
gin of the multipole expansion coincides with the centre of mass, 
one only needs to store the monopole and the quadrupole moments. 
The monopole is simply the mass of all the particles in the cell and 
can be computed very fast with the SUM command, since all parti- 
cles of a given node, having been sorted, are contiguous in mem- 
ory. For the 11-cells one computes the components of the traceless 
quadrupole tensor due to each individual particle a, 




Figure 3. Illustration of the spatial splitting and labelling of cells on differ- 
ent levels for a two-dimensional particle distribution. Dashed lines (planes 
in 3D) are perpendicular to the longest cell side and cross the centres of 
mass. The tesselation ends once the average number of particles per lowest- 
level cell has dropped below a prescribed value, Nu. 



Qfj = 3m a AxfAx a j - m a (Ar a ) 2 5f y , (1) 

where m a is the particle mass, Ax% are the components of the parti- 
cle a's offset from the cell centre of mass and (Ar fl ) 2 = £,-(Ax? ) 2 - 
Subsequently, these quadrupole moments are 'shifted up' to the 
parent nodes. Specifically, the quadrupole moments Qfj of a par- 
ent node P are computed in terms of the quadrupole moments Qfj 
of its children C as: 

Qu = L N + 3MCAx i - MC ( ArC ) 2 Su} , (2) 

c 

where M c is the mass of child C, Axp is the distance between the 
centres of mass of the parent and of its child along the axis x,-, 
and Ar c is the distance between their centres of mass: (Ar c ) 2 = 
I/(Axf) 2 . 



2.1.7 Parallelisation 

Since each level of the tree depends on the previous levels (the par- 
ticles must be sorted in the parent node before the children nodes 
can be created) simple OpenMP work sharing constructs will not 
work out of the box. Instead, we use an MPI-like approach: assum- 
ing that 2 k processors are to be used for the tree build, one builds 
the tree down to level k on just one processor, and then 'distributes' 
each of the 2 k nodes on level k to one processor. Since the particles 
are sorted down to level k, each processor can now build its own 
'sub-tree' by simply ignoring the rest of the tree and acting as if its 
assigned node were the root node. 
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Figure 4. The order in which tree nodes are accessed during a tree walk. 
Jumps between non-neighbouring cells are indicated by curved arrows. 
Note that the original connection between a parent cell and its right daugh- 
ter, marked with dashed lines, is no longer significant for the tree walk. All 
tree walks begin with node 2. 

2.1.8 Updating the tree 

Although heavily optimised and never taking more than 1 % of the 
total computational time, the tree build is still an expensive oper- 
ation that should be avoided when possible. For this reason, the 
integrity of the tree is checked after every time step: if the particles 
have not moved out of their 11-cells then the tree does not need to 
be rebuilt. It is sufficient to simply update the relevant quantities of 
the 11-cells (centre of mass, radii of influence and quadrupole mo- 
ments), and then compute these quantities for all the larger cells in 
a bottom-up manner. Since no tesselation and sorting are required, 
the procedure is much faster than a full-scale tree build and can be 
used until one particle crosses the border of its 11-cell. At that point, 
the tree could in principle be 'repaired' by locally adjusting the 
tree structure where necessary. For now, however, we completely 
rebuild the tree whenever this 'integrity check' fails, otherwise we 
just update it. 

2.2 Tree walk 

In order to find properties (such as neighbours or gravitational 
forces) of particles in the 11-cells we have to 'walk the tree'. Recur- 
sive tree walks can be programmed elegantly and in a very compact 
way, but they are usually too inefficient for use in astrophysical ap- 
plications, since they complicate to take optimization decisions at 
compile time and each recursive call comes with its own compu- 
tational overhead. We have implemented a modified version of the 
depth-first search algorithm (Cormen et al. 2001, §22.3) which has 
been used in astrophysical applications, for example, by Dubinski 
(1996). The order in which the tree nodes are accessed is shown in 
Fig. 4. 

A tree walk requires a criterion (one for a neighbour and one for 
a gravity walk) for the decision whether to accept a node or to 
open it and examine the two daughter nodes. If an 11-cell cannot 
be accepted then all its particles are processed individually. A tree 
walk ends when all the nodes in the tree have been either opened or 
skipped. Since the root node contains all the particles, it can never 
be accepted by any criterion; hence, the tree walk always begins 
with node 2. 

2.2.1 Tree vector 

The tree needs to be walked at least once per time step for every 
11-cell that needs an update, therefore the tree walk efficiency is of 
key importance for the overall runtime. The computations involved 
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Figure 5. The 'tree vector' is a one-dimensional array which stores all the 
relevant properties of all nodes contiguously, in exactly the order in which 
they are accessed during the tree walk, see Fig. 4. This allows for a very 
cache-efficient data access. In a typical simulation with a few million parti- 
cles, the use of the tree vector can reduce the time spent for tree walks by 
as much as two orders of magnitude. 

in the neighbour search (simple comparisons) are not complicated, 
but the sheer memory span that the algorithm has to traverse (a 
tree can have millions of cells scattered throughout the RAM) can 
slow down the subroutine considerably. Therefore just accessing 
the node data in the order of the tree walk would lead to contin- 
uous and repeated cache misses for every tree walk and therefore 
would cause a serious slowdown. The tree data, however, is always 
accessed in the same order. This led us to the idea of introducing a 
'tree vector', a one-dimensional array which stores all the needed 
data (label, level, centre of mass, geometrical centre, position, size) 
of the nodes contiguously, in exactly the order in which they are 
accessed during the tree walk. This is illustrated in Fig. 5. The tree 
vector is created once per time step, directly after the tree build, 
and it is subsequently used in all the tree walks. Inefficient data ac- 
cess therefore just occurs once (in the tree vector creating stage) 
rather than hundreds of thousand times during repeated tree walks. 
The subsequent tree walks all receive their data as a contiguous 
batch of elements from the tree vector. If a node has to be opened, 
the next batch of elements, corresponding to the node's left daugh- 
ter, is read from the tree vector, and the code simply advances a 
counter; usually, chances are that this batch is already loaded in 
cache (since whole segments, and not individual numbers, are read 
from the RAM at once). If the cell has to be skipped then all its 
2 P — 2 descendants, see Sec. 2.1.5, are skipped, in which case the 
tree vector counter is simply incremented by the required amount. 
The tree walk ends when the counter reaches the end of the tree vec- 
tor. In our benchmark tests with a few million particles distributed 
according to a Sobol sequence, see Sec. 3, the use of a tree vector 
speeds up the tree walks by as much as two orders of magnitude. 

2.3 Neighbour search 

In SPH, continuous or 'smoothed' quantities at a given position are 
obtained by summing up the kernel-weighted properties of con- 
tributing particles. Similarly, gradients of fluid properties are cal- 
culated as sums over the analytically known kernel gradients (see, 
e.g., Rosswog (2009) for a recent review of the SPH method). We 
use here the 'standard' cubic spline kernel of Monaghan & Lat- 
tanzio (1985), which has compact support and vanishes outside of 
a finite radius 2h. Since contributing particles ('neighbours') need 
to be known at every time step, it is crucial to have a very efficient 
algorithm to identify them. 

2.3.1 Neighbour walk 

To determine possible neighbour candidates, we walk the tree as 
described above for every 11-cell that requires an update. During 
the tree walk we check whether a candidate cell B can be reached 
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Figure 6. The radius ft no( j c of cell A (dashed green circle) encompasses 
the kernels of all the particles in A (dotted green circles). The interac- 
tion radius /2 no de_int of cell B (dashed blue circle) encompasses all particles 
in B. If we denote the distance vector between the geometric centres of 
the cells by r g .Afl = r gJ { — r g g, the 'reachability condition' translates into 
l r g,Afl| < ^node(^) + ^nodeJnt (B) . If this test fails, cell B is discarded and all 
its children are skipped in the tree walk. 

from the cell of interest, cell A. To this end we had assigned in the 
tree building phase to each 11-cell the 'interaction radius' ft n odeW> 
which encloses the volume covered by the individual kernels of all 
its particles, see Fig. 6. Moreover, each cell B had been assigned a 
radius h node j m (B) that starting from the geometric centre of cell B 
enclosed all its particles. The particle content of cell B is included 
in a 'candidate list' for possible neighbours of particles in cell A, if 

\ r g,AB\ = \r g A - r g,B \ < Kode{A) + K ode _ iat (B), (3) 

i.e. if particles in cell A can possibly reach at least one of the par- 
ticles in cell B. Although not strictly necessary, in a subsequent 
step we cull the candidate list to the true neighbours to save some 
computational effort during the later summation stage (to calculate 
densities and the hydrodynamic derivatives). 

2.3.2 Neighbour list 

Storing a list of neighbours for each particle can be problematic: we 
typically use a neighbour number of ~ 100 per particle (Rosswog 
2009, Sec. 2.9), but under certain circumstances individual particles 
can have a much larger number of neighbours. This can happen, for 
example, if low-density regions with large smoothing lengths inter- 
act with high-density regions with small smoothing lengths. If a 
low density particle is expanding, even a small fractional increase 
of its smoothing length can lead to a large jump in the neighbour 
number if suddenly the high density region 'becomes visible'. Un- 
der such circumstances one has two possibilities: either fine-tune 
the smoothing length immediately to a value so that the neighbour 
number is in an acceptable range, which can introduce a substantial 
amount of numerical noise, or accept the large neighbour number 
for a few time steps for this particle, in which case one has to take 
care of the storage of a very long neighbour list for the respec- 
tive particle. Since such sitations are exceptions, it is a waste of 
memory to store 'worst-case-sized' lists of neighbours in a fixed 
bi-dimensional array (such that neighbours \...k of particle n are 
stored at positions ( n , 1 : k ) ); one would have to allocate the max- 
imum space for each particle, even though most of them will have 
far fewer neighbours. 

An elegant solution to this problem is to store the neighbours in 



a one-dimensional list. This comes with a slight computational 
overhead, since one has to also store, for each particle, the total 
number of neighbours and the index of its first neighbour in the 
one-dimensional list. The memory savings, on the other hand, are 
tremendous, for one has to only allocate an 'average' number of 
neighbours per particle, rather than a 'maximum' one. 

2.3.3 Parallelisation 

If the 11-cells whose neighbours have to be updated are distributed 
to different processors, the computations are independent and can 
be programmed with OpenMP work-sharing constructs. We use dy- 
namical scheduling - gradual allocation of the iterations to various 
threads as they become idle - in order to prevent load imbalance. 
Combining the results in a one-dimensional array, however, consti- 
tutes a possible bottleneck, since different threads might want to si- 
multaneously modify the same array, in which case they are queued 
in the order in which they finish their calculations. This introduces 
a waiting time of a few percent of the total neighbour search, but is 
nevertheless preferable to using a bi-dimensional array. 

2.4 Gravitational forces 

To achieve accurate gravitational forces at a moderate computa- 
tional effort, we split gravity into a near- and a far-field contribu- 
tion. The near-field component is crucial for the accuracy and it 
is obtained by a kernel-smoothed direct summation, while the far- 
field contribution is calculated via a low-order multipole expansion 
up to quadrupole order. The gravitational constant G is set to unity 
throughout the paper. 

2.4.7 Multipole acceptance criterion 

During a gravity tree walk a multipole acceptance criterion (MAC) 
decides whether a node can be accepted with its multipole moments 
or whether it needs to be further resolved into its constituents for 
higher accuracy. In the first case all descendants can be skipped, in 
the second case the tree walk advances down the tree until either 
the acceptance criterion is met, or a lowest-level cell is reached, 
in which case all its particles are added to a near-field list. After 
experiments with various criteria that have been published in the 
literature (Salmon & Warren 1994; Nelson et al. 2009) we decided 
for the simple geometric MAC introduced by Barnes & Hut (1986): 
a cell B is accepted if the opening angle under which it is seen drops 
below a prescribed accuracy parameter 6: 

Hr 

— - < 9 with H B = MAX (5/ B ). (4) 

R AB 

Here, the Sj,s are the side lengths of cell B and Rab = \>"a — 1"b\ is me 
distance between the centres of mass of cells A and B. For 9 = this 
reproduces, of course, the direct summation case. Furthermore, if a 
cell B contains potential neighbour particles, we add its constituents 
to the near-field list, and do not accept the cell, independent of the 
MAC. This leads to relatively long near-field lists (~ 800 particles), 
which come at an N 2 -price, but which are also important for high 
force accuracy. Moreover, the near-field contribution parallelizes 
perfectly well, so for large parallel calculations this becomes an ac- 
ceptable procedure. As we will show below, our algorithm is very 
efficient despite the relatively long near-field lists. The rare, poten- 
tially unbounded errors introduced by the simple geometric MAC 
and described by Salmon & Warren (1994) are avoided in our code, 
as explained in Sec. 3.1. 
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(Monaghan & Lattanzio 1985) the near gravity acceleration on par- 
ticle a then becomes 



Figure 7. Forces at particle positions are calculated via a Taylor expansion 
around the cell centres of mass. For a particle a at position r a , the acceler- 
ation f a due to cell B is obtained by Taylor expanding f(r A + Ar a ) around 



2.4.2 Far field 

The multipole expansion of the gravitational potential <3>(r) due to 
a distant node with mass M and quadrupole moments g,y reads in 
Cartesian coordinates: 
M 



^----l^Qu + Oir-l). 
U 



(5) 



Once a list of acceptable, distant nodes has been identified via a 
tree walk, the resulting accelerations can be calculated as the gra- 
dients of the truncated potential, /(r) = -V<J>(r). We calculate the 
far gravity acceleration on particles in a given 11-cell via 'cell-cell 
interaction', similar to what is done in the FMM and the tree code 
of Dehnen (2000, 2002). To this end we calculate the Taylor ex- 
pansion of the forces around the centre of mass of an 11-cell, see 
Fig. 7. For a particle a in cell A the far-gravity contribution reads to 
second-order: 



ff g {r a ) =ff z {fA +Ar fl ) ~f fg (r A )+J A Ar a + ^ArjH A Ar fl . 



(6) 



Here, J A and are the Jacobian and Hessian as evaluated at point 
r A . In three dimensions they only have 6 and 10 independent com- 
ponents, respectively, due to the equality of mixed partial deriva- 
tives. This Taylor expansion allows to evaluate the far gravity only 
once per 11-cell and ensures the approximate O (N) behaviour for 
large A'. 



2.4.3 Near field 

Our goal is to simulate a self-gravitating fluid rather than a point 
particle system. Therefore, the mutual interaction between nearby 
particles needs to be softened. We use the SPH kernel function W 
to smooth both hydrodynamics and near-field gravity in a consis- 
tent way. The kernel-softened force (e.g. Dyer & Ip 1993; Dehnen 
2001) on a particle a due to b is given by: 



m a m h „ 

Pa = 2 — *\ab e ab, 

r ab 

where the softening is mediated via 

TU = 47C r 2 W(r,h)dr. 
JO 



(7) 



(8) 



Here we have used e ah = r ah /r ah , r ab =r a -r h . Thus the gravi- 
tational force is smoothly switched off as the particles approach 
each other. For constant h, Poisson's equation shows that this force 
law corresponds to the force on a point particle a due to a density 
p(r fl ) = mi,W(r a i,,h). For the commonly used cubic spline kernel 



fng(r a ) 



with 



S(q)-. 



- £ m b S{r ab ,h)e ab 




M 2 - 



h 3 



(9) 



0^q< 1 
l^q<2 (10) 
q>2 



and q = r ab /h. To ensure exact conservation, S should be symmet- 
ric with respect to the particle labels a and b, which we enforce by 
using h = 0.5(h a + h b ). 

Alternatively, a discretized set of self-gravitating SPH equations 
can be derived consistently via a variational principle from a La- 
grangian which contains an ideal fluid contribution and additional 
self-gravity terms (Price & Monaghan 2007). Such an approach 
delivers consistently softened SPH equations and in addition also 
'gravitational grad-A' terms which are similar to the correction 
terms derived for the SPH equations (Springel & Hernquist 2002; 
Monaghan 2002). Such a treatment is beyond our current focus, but 
if desired, it can be implemented in a straight forward way. 

2.4.4 Parallelisation 

Since computing the gravitational accelerations acting on two dif- 
ferent 11-cells involves unrelated calculations, dynamically sched- 
uled OpenMP work-sharing constructs work well for the gravity 
calculations and no specific optimisation is needed. 

2.5 Comparison with other work 

Some elements of our RCB tree have been used in previous work. 
Here we will briefly summarize similarities and differences. The 
most commonly used type of tree in astrophysics is the Barnes-Hut 
oct-tree. Since we have outlined some basics above and since this 
type of tree is fundamentally different from ours, we do not further 
discuss it here. 

The only astrophysical tree that is built similar to ours (as a fcd-tree), 
is the one used in GASOLINE (Wadsley et al. 2004) which is based 
on the PKDGRAV code (Stadel 2001). It also does not build the tree 
down to the particle level, but instead down to what the authors call 
'buckets', which correspond to our 11-cells. There are a number of 
differences between GASOLINE and our tree. In GASOLINE cells 
are split according to MPS while we use CMS, see Sec. 2.1.3. We 
found the CMS splitting to be substantially more accurate in terms 
of worst case errors. The authors calculate far-gravity contributions 
per particle using a hexadecupole expansion for acceptable cells 
while we calculate it per ll-cell using a quadrupole expansion and 
obtain the forces at particle positions via a Taylor expansion. Also 
the near-gravity approach differs. They base their direct summation 
list decision on a purely cell property-based opening criterion, their 
Eq. (1), which does not involve any information about the smooth- 
ing lengths. Therefore, no consistency between hydrodynamics and 
gravity is guaranteed in the sense that hydrodynamically interacting 
particles also interact (i.e. are smoothed) gravitationally. In princi- 
ple, this can deteriorate the conservation properties, but depending 
on the chosen parameters this may or may not be relevant in practi- 
cal simulations. We pursue a more conservative (though somewhat 
more costly) strategy in the sense that reachable cells are added 
to the direct summation list, independently of the MAC. However, 
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if further optimization is desired, this could be changed by simply 
modifying one line in the gravity-walk subroutine. 
As will be shown below, our tree scales close 0(N), which is due 
to the far-force being calculated via a Taylor expansion only once 
per 11-cell. Such cell-cell interactions are at the heart of the FMM 
method (Greengard & Rokhlin 1987). The idea of cell-cell inter- 
actions, i.e. using Taylor expansions at both the 'sink' and the 
'source' side, has been used in the 0(N) tree of Dehnen (Dehnen 
2000, 2002). Contrary to our approach, Dehnen uses a standard oct- 
tree structure and makes explicit use of symmetries in the double- 
sided Taylor expansions in order to reduce the operational effort. 
Combined with a MAC that is symmetric in the properties of the 
two interacting cells this explicitely enforces momentum conserva- 
tion by ensuring that Newton's third law is obeyed. The benefits 
of this approach come at the price of larger memory consumption; 
also, the explicit use of the symmetries in Taylor expansion coef- 
ficents (cells are evaluated as both 'sinks' and 'sources') prevents 
the use of standard tree walks. Instead, Dehnen uses a two step ap- 
proach with an 'interaction' and 'evaluation' phase, see Sec. 3.2 in 
Dehnen (2002) for details. The benefit of exact conservation, how- 
ever, needs to be given up when individual time steps are used. 
Moreover, Dehnen' s tree is designed for N-body codes rather than 
for simulating self-gravitating fluids as in our case. Therefore, the 
MAC is not 'overruled' by a 'reachability' criterion as in our (con- 
servative) approach described in Sec. 2.4.1. As an effect, the near- 
field lists can be kept shorter which reduces the cost of the direct 
summation part of the code. As outlined above, such an approach 
could be easily implemented in our tree, but for now we stick to the 
more conservative (and slightly more expensive) approach. 



3 BENCHMARK RESULTS 

Motivated by possible future applications of our RCB tree code, 
we chose three sets of initial conditions: particles distributed in a 
sphere according to a quasi-random Sobol distribution (Press et al. 
1992), see Fig. 8, and two snapshots from SPH simulations: a white 
dwarf - white dwarf (WD-WD) collision with 602506 particles 
(Rosswog et al. 2009), see Fig. 9, and a WD (500677 particles) 
that has been tidally disrupted by an intermediate-mass black hole 
(IMBH) of 1000 M (Rosswog et al. 2009), see Fig. 10. The Sobol 
sequence yields particles that homogeneously fill the allowed space 
(here a sphere), but its quasi-randomness makes it a 'worst case 
scenario' for efficient memory access. The other two scenarios are 
close to the future applications of our code and they are challenging 
in terms of a complicated geometry with a very large spectrum of 
particle densities and smoothing lengths and in the sense that the 
particles are scattered across memory, i.e. spatially nearby particles 
are not close in terms of memory location. 

As explained above, our tree will become an ingredient of several 
codes which will differ in their time integration methods. There- 
fore, we only present tests of speed and force accuracy calculations 
for static particle distribution snapshots. We aim for an average rel- 
ative force error of ~ 0. 1%, but also monitor the force errors of all 
particles to make sure that even the most extreme outliers remain 
at an acceptable accuracy. This is crucial since among our intended 
applications are neutron stars which, as a result of their nearly in- 
compressibe matter equation of state at high density, exhibit very 
sharp surfaces. Particles with large errors in such a surface region 
can be disastrous for the whole simulation. 

For all tests we compiled the serial code with the Intel Fortran 9.1 
compiler, using the -03 optimisation flag and the double precision 




e = |8a|/| a | 

Figure 11. Cumulative error curves for 9 = 0.4,0.5,0.6,0.7,0.8,0.9 and 
1.0, showing the fraction P(< e) of particles with relative errors |Sa|/|a| 
smaller than the value e given on the x axis. This figure has been obtained 
for the particle distribution of a WD-WD encounter (N = 602506, = 0.7). 
All other cases yield very similar distributions. 

flag -real-size 64. The code was executed on an Intel Xeon 
E5420 processor running at 2.50 GHz, with 6 MB of L2 cache and 
8 GB of RAM. 

3.1 Accuracy 

As a first test we explored the sensitivity of the force accuracy to 
the parameter 9. The cumulative distribution of relative errors in the 
acceleration, £,- = |8a,-|/|«;|, was investigated for in a range from 
0.4 to 1 . Our RCB tree results are compared against direct, kernel- 
smoothed summations according to Eq. (9) which are correct to 
double precision accuracy (15 significant digits). The cumulative 
error distributions are plotted in Fig. 11. For 6 = 0.7, 99% of the 
particles have relative errors less than 0.2%. Fig. 12 shows in de- 
tail the relation between the relative force errors £,■ and the absolute 
values of the force, |a,-|, in a typical simulation. Since the plots for 
our test cases are quantitatively very similar, we only present the 
results for the WD-WD encounter. Also, since the particles with 
£ < 0.2% are uniformly distributed across the entire force spec- 
trum, we do not plot them. The plot demonstrates that all the par- 
ticles with large relative errors only experience a very weak total 
gravitational force. 

The above accuracy is acceptable for most astrophysical applica- 
tions, therefore we make 9 = 0.7 our default value. We find that 
the accuracy plots for all three particle distributions are virtually 
identical, therefore we display only the results for the WD-WD 
encounter. This robustness with respect to the geometry of the par- 
ticle distribution underlines the versatility of our approach and is a 
highly desired property for an astrophysical simulation algorithm. 
In all of the cases, the very few particles with relative errors above 
0.2% feel an essentially zero net force. Formally, of course, this 
relative error can diverge (for exactly |a, | = 0), but in practice this 
does not have any influence since the particles just do not move 
during a time step. 

Salmon & Warren (1994) showed that the conventional geomet- 
ric MAC, see Eq. (4), can introduce unbounded errors in certain 
pathological cases (see their Appendix A for the 'detonating galax- 
ies' scenario). In our RCB tree, these situations are avoided in two 
ways. On the one hand, the CMS described in Sec. 2.1 protects 
against having very distant particles in the same cell. On the other 
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Figure 8. A particle distribution within a sphere, 
obtained via a Sobol quasi-random sequence. 
Such a particle distribution is homogeneous, but 
-due to its randomness- represents a 'worst case 
scenario' for efficient memory access. 
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Figure 9. Snapshot from a white dwarf - white 
dwarf off-centre collision (602506 particles). The 
smoothing lengths span a range of over three or- 
ders of magnitude, due to the large density dis- 
crepancy between the stellar centers and the ac- 
cretion flow between them. 
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Figure 10. Snapshot from a tidal disruption 
of a white dwarf (500677 particles) by an 
intermediate-mass black hole of 1000 M . The 
large range of smoothing lengths, five orders of 
magnitude, make the force calculation challeng- 
ing. 
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Absolute value of the force [code units] 

Figure 12. The relative error e = |8o|/|a| plotted against the absolute value 
of the force, \a\. Accuracies of £ < 0.2% are obtained for particles across the 
full force spectrum (for over 99% of all particles). These particles are not 
shown here, instead, we focus in this plot on the outliers. Only particles with 
very small forces exhibit force errors in excess of 1%. This figure has been 
obtained for the particle distribution of a WD-WD encounter (N = 602506, 
8 = 0.7). All other cases yield very similar distributions. 

hand, since the particle content of each reachable cell is summed 
up directly and independent of the MAC, we are protected against 
potentially unbounded errors from particles that get very close to 
the sink but whose cells are, for some reason, still accepted by the 
MAC. 



3.2 Performance 

3.2.1 Optimizing the tree depth 

We experimentally optimize the depth of our RCB tree, which is de- 
termined by the average number of particles per 11-cell, Nu . Sample 
results for a spherical Sobol distribution with N = 500000 are pre- 
sented in Table 1 . As expected, the tree build becomes faster as the 
height of the tree decreases, but in none of the cases the tree build 
takes more than 1% of the total computing time. The neighbour 
search and the gravity calculations depend strongly on the height 
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Table 1. Scaling of the RCB tree with the average number of particles per 
11-cell, Nu. The following abbreviations are used: L: number of tree levels; 
P<1%: fraction of particles with an error smaller than 1 %; TB: tree build, 
N: neighbour search, NG: near-field gravity, FG: far-field gravity, G; total 
gravity. Times are measured in seconds. 



Nu L P <1% TB [s] N [s] NG [s] FG [s] G[s] 



1 


19 


0.9998 


0.76 


23.54 


6.14 


99.26 


171.47 


3 


18 


0.9998 


0.51 


10.90 


7.15 


39.76 


75.52 


6 


17 


0.9998 


0.36 


6.67 


10.16 


16.12 


37.83 


12 


16 


0.9998 


0.29 


5.86 


14.66 


6.66 


25.90 


24 


15 


0.9997 


0.25 


7.07 


22.02 


2.71 


26.67 


48 


14 


0.9997 


0.24 


10.56 


36.22 


1.02 


38.13 


64 


13 


0.9996 


0.22 


14.97 


56.90 


0.50 


57.88 


128 


12 


0.9997 


0.19 


26.12 


101.51 


0.18 


102.04 



The test was performed on a spherical Sobol distribution with 500 000 parti- 
cles and smoothing lengths chosen so that the average number of neighbours 
was 105. With our standard value for the accuracy parameter, 9 = 0.7, the 
average relative error £ ~ 0. 1 % for all test cases. Particles with errors larger 
than 1 % are essentially force free (their force is about five orders of magni- 
tude smaller than the average force). In an extensive set of tests we always 
found optimal results for Nu ~ 12, regardless of the number of particles or 
their distribution. 

of the tree: larger 11-cells mean a shallower tree and thus faster tree 
walks and far-gravity, while the near-gravity neighbour candidate 
lists become longer and increasingly more expensive to evaluate. 
The balance between the two regimes was empirically found at 
Nu ^ 12 regardless of the number of particles or the complexity of 
their spatial distribution. The performance is rather robust against 
substantial changes in this number, e.g. doubling this value to 24 
yields essentially the same performance. All of our subsequent test 
calculations were obtained with Nu = 12. 

3.2.2 Comparison with the Press tree 

We chose to compare the RCB tree with a commonly used binary 
tree that is due to Press (Press 1986; Benz et al. 1989), mainly 
because we have used this tree for years and because this is the 
tree that our new RCB tree will replace. The version used in 
this comparison has been tuned by collecting quantities that are 
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Figure 13. Comparison between the Press and the RCB tree in terms of 
the tree building time and of the cumulative time required for neighbour 
search and gravity calculations. The particle distributions for this test were 
spherical Sobol distributions with up to 4 million particles. The smooth- 
ing lengths were chosen such that the average neighbour number was 105. 
The plot shows a cross-over point for the gravity + neighbours timings at 
N S3 50000: for very low particle numbers, the complex infrastructure of the 
RCB tree renders it slower (up to a factor of two) than the Press tree. 



frequently used together into common arrays (Rosswog & Price 
2007). This has improved the overall performance for completely 
unsorted particle arrays by nearly a factor of three with respect 
to the original version. We have chosen the accuracy parameters 
of the two trees so that we obtain overall very similar accuracies, 
an average relative error of 6 ~ 0.1% and a maximum error 
Emax ;$ 1%. but with the RCB tree being slightly more restrictive 
and producing slightly higher accuracies. This motivated our 
choice of 0p reS s = 0.5 for the Press and Orcb = 0.7 for our RCB 
tree. The larger 0-value for the RCB tree has two explanations: the 
Press tree has tight bounds by construction (Anderson 1993) and 
thus exhibits physically smaller nodes, and the MAC in our RCB 
tree is more conservative in the sense that it chooses the longest 
cell edge as a measure of the cell size, see Eq. (4). 

Tree build 

In a first test we compare the performance during the tree building 
phase. Due to the more complicated algorithm for identifying 
mutual nearest neighbours, building the Press tree from scratch is 
quite time consuming. In fact, from all the trees frequenty used in 
astrophysics, the Press tree is probably the most expensive one to 
build. As test cases we chose spherical Sobol distributions with 
smoothing lengths so that the average neighbour number was 105 
(this is conservative for the gravity calculation with our RCB 
tree algorithm since it uses large particle numbers in the direct, 
near-gravity summation and the results would be even stronger 
in favour of the RCB tree for smaller neighbour numbers). The 
results of this test are presented in Fig. 13 with filled (open) 
squares for the RCB (Press) tree. Here, the RCB tree turns out to 
be substantially faster than the Press tree, for 4 x 10 6 particles (the 
maximum we could afford for our version of the Press tree) on one 
processor our RCB tree build is already ~ 25 times faster than the 
Press tree with increasing discrepancy for larger particle numbers. 

Neighbour search and gravity 

The second performance measure is the sum of the times for neigh- 
bour search and gravity. The reason for summing them up is that 
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Figure 14. Comparative timings of the RCB tree code components (tree 
build, neighbour search and gravity calculations) for spherical Sobol distri- 
butions with up to 10 8 particles and ~ 105 neighbours per particle. Since 
the far-field gravity calculation does not depend on the particle number, but 
only on the tree depth the 'far gravity' in the plot remains essentially con- 
stant until the number of tree levels is incremented. The tree build always 
takes less than 1 % of the execution time, the gravity calculation typically 
takes 80-90% percent, and the neighbour search occupies the remaining 10- 
20%. 



our version of the Press tree performs both operations in the same 
loop for increased performance. We decided to keep the RCB tree 
code flexible and separate the two operations in different subrou- 
tines that can be called independently. One would therefore expect 
the RCB code to perform even better if one was willing to sacrifice 
this flexibility and to calculate both in a single loop. The RCB tree 
outperforms the Press tree in this test already at N ~ 50000, see 
the filled triangles for the RCB and the open triangles for the Press 
tree. Up to this point its relatively complex infrastructure makes 
the RCB tree slower (up to a factor of two, for these particle num- 
bers corresponding to fractions of a second on one processor). Near 
4 x 10 particles, however, RCB neighbour search and gravity are 
faster by about a factor of 6 with larger discrepancies for increasing 
particle numbers. In this and the subsequent plot one notices little 
'glitches' in the RCB curves. They occur whenever the height of 
the tree changes, since tree walks and far-field calculations, being 
performed 'per U-celT rather than 'per particle', only depend on the 
number of 11-cells and not directly on the number of particles. 

3.2.3 Behaviour for large N 

In order to test the robustness and the scaling behaviour of our RCB 
tree we computed, on one processor, the self-gravity of Sobol dis- 
tributions with increasingly larger particle numbers, ending only at 
10 8 particles. Successfully running such a simulation on a machine 
with only 8 GB of RAM gives an upper limit on the memory con- 
sumption of our code of approximately 86 bytes per particle. This 
includes not only the arrays that store particle properties, but also 
the ones that store the nodes, the temporary variables, the subrou- 
tines, the files read in memory, and every other bit of RAM that our 
code uses. 

In Fig. 14 we compare the time spent for different operations. The 
tree build becomes completely negligible as the number of parti- 
cles increases (even in the worst case it never surpasses 1% of the 
total execution time). The tree build in the Press tree has been one 
of the most expensive components, see also Nelson et al. (2009), 
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Figure 15. CPU time spent per particle in neighbour search, gravity calcu- 
lations, and cumulative neighbour search + gravity calculations for spheri- 
cal Sobol distributions with up to 10 s particles and ~ 105 neighbours per 
particle. For a given number of levels the RCB tree scales slightly better 
than O (N) , and the execution time only increases by a constant (the plot is 
shifted upwards) when another level is added to the tree. Overall, the exe- 
cution time per particle increases much slower than the log(/V) expected for 
standard tree codes. When increasing the particle number by a factor of 10 4 
the execution time per particle increases merely by a factor of two. 



and has been avoided whenever possible, usually by 'revising' the 
tree and amortising the cost of the tree build over a number of time 
steps. This, however, can become difficult: simulations involving 
black holes where particles are frequently absorbed at the horizon 
require a frequent tree build (if they orbit safely inside the horizon 
they can be temporarily stored in a special list and be removed only 
later, see Rosswog (2005), however at a substantial bookkeeping 
effort). Furthermore, since the tree build is usually the component 
of the code with the worst parallel scaling, having a fast tree build 
algorithm becomes crucial when parallelising the code. Therefore, 
the tree build is, together with the scaling behaviour close to O (N), 
one of the strongest points of the RCB tree. The time for neighbour 
search takes 10-20%, the remaining 80-90% are invested for the 
gravitational forces. 

In Fig. 15 we display the CPU time spent per particle in differ- 
ent parts of the code. This figure shows the 0(N) scaling of the 
neighbour search, even at 10 8 particles (the time spent searching 
the neighbours of one particle is always ~ 13 (J.s). At a given tree 
depth the gravity calculation scales slightly better than O (N) for a 
given height of the tree (i.e., as more particles are added to the same 
cells, the CPU time per particle drops). This is partly an effect of 
the grouped tree traversal: since one tree walk and one 'far force' 
evaluation are performed per 11-cell, these two components of the 
code do not depend on the number of particles, but only on the 
height of the tree. This makes filling an 11-cell with more and more 
particles increasingly more efficient in terms of tree-traversals and 
far-gravity calculations (such behaviour can occur in any tree code, 
since adding one more level to a tree always increases the time 
needed by the tree walk, regardless of whether it is done per parti- 
cle or per 11-cell). The near- gravity calculation, however, becomes 
more expensive since a larger number of particles is summed up 
directly. The total time per particle spent with tree operations is 
increasing somewhat, but at a much slower pace than the log(iV)- 
behavior encountered for standard tree codes. When increasing the 
total particle number by a factor of 10 4 the time per particle merely 
increases by a factor of two. 
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Figure 16. Scaling of the total CPU time (tree build + neighbour search + 
gravity calculations) with the number of processors. This plot also shows, 
with dashed lines, the ideal scaling for each of the five test cases. Not only 
are the results very close to the ideal scaling, but they improve as the number 
of particles N increases. 



3.2.4 OpenMP parallelisation 

The parallel code was compiled and executed on an SGI Altix 
3700Bx2 machine with 24 Itanium2 processors (Madison9M) run- 
ning at 1 . 6 GHz, with 6 MB of L2 cache and 96 GB of shared RAM. 
We tested the scaling of the RCB tree code with both the number of 
processors and particles. We chose three Sobol distributions (with 
1, 4 and 8 million particles, respectively) and the two snapshot tests 
described above; the corresponding results are presented in Fig. 16. 
The results obtained in this test are rather robust against changes 
in the particle distribution and even particle number. In all cases a 
speedup of > 21 was obtained on 24 processors. 



4 CONCLUSIONS 

We have introduced a new tree for neighbour search and gravity that 
is based on recursive coordinate bisection. This binary RCB tree is 
only built down to an optimal depth so that there are still about 
12 particles left per lowest-level cell. This property makes the tree 
fast (since some operations are only performed per lowest-level cell 
rather than per particle) and memory efficient since fewer aggre- 
gated quantities need to be stored for such a shallow tree. Gravity is 
split into a near- and far-field component; the first is calculated via a 
direct, kernel-smoothed summation while the latter uses a 'cell-cell 
interaction' based on Cartesian multipole and Taylor expansion. 
We have compared the performance of our RCB tree against that 
of the 'Press tree' that we had used earlier on various occasions. 
As expected for such a 'top-down' tree, the tree building phase is 
subtantially faster for RCB tree. At four million particles the tree 
build is faster by about a factor of 25, neighbour search and gravity 
(at slightly higher accuracy for the RCB tree) are faster by a fac- 
tor of six. These ratios become even more favorable for the RCB 
tree with increasing particle numbers. The code was tested for up 
to 10 8 particles on a single processor, showing very good scaling 
behaviour close to O (N) and a low memory consumption. 
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